Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input
نویسندگان
چکیده
منابع مشابه
Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input
In this paper, we explore neural network models that learn to associate segments of spoken audio captions with the semantically relevant portions of natural images that they refer to. We demonstrate that these audio-visual associative localizations emerge from network-internal representations learned as a by-product of training to perform an image-audio retrieval task. Our models operate direct...
متن کاملLearning Spoken Words from Multisensory Input
Speech recognition and speech translation are traditionally addressed by processing acoustic signals while nonlinguistic information is typically not used. In this paper, we present a new method which explores the spoken word learning from naturally co-occurring multisensory information in a dyadic(two-person) conversation. It has been noticed that the listener always has a strong tendency to l...
متن کاملInterference of spoken word recognition through phonological priming from visual objects and printed words.
Three cross-modal priming experiments examined the influence of preexposure to pictures and printed words on the speed of spoken word recognition. Targets for auditory lexical decision were spoken Dutch words and nonwords, presented in isolation (Experiments 1 and 2) or after a short phrase (Experiment 3). Auditory stimuli were preceded by primes, which were pictures (Experiments 1 and 3) or th...
متن کاملLearning words from natural audio-visual input
We present a model of early word learning which learns from natural audio and visual input. The model has been successfully implemented to learn words and their audio-visual grounding from camera and microphone input. Although simple in its current form, this model is a rst step towards a more complete, fully-grounded model of language acquisition. Practical applications include adaptive human-...
متن کاملDifferences between written and spoken input in learning new words
We trained adult learners the meanings of rare words to test hypotheses about modality effects in learning word forms. These hypotheses are that (1) written (orthographic) training leads to a better representation of word form than phonological training, that (2) recognition memory for a word is partly dependent upon congruence between training and testing modality (written vs. spoken) but that...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: International Journal of Computer Vision
سال: 2019
ISSN: 0920-5691,1573-1405
DOI: 10.1007/s11263-019-01205-0